spike(workflow): concurrent dynamic dispatch harness (DynamicNodeSupervisor + ctx.pipeline)#3
Draft
caohy1988 wants to merge 64 commits into
Draft
spike(workflow): concurrent dynamic dispatch harness (DynamicNodeSupervisor + ctx.pipeline)#3caohy1988 wants to merge 64 commits into
caohy1988 wants to merge 64 commits into
Conversation
8e8905f to
fd3bd52
Compare
caohy1988
added a commit
that referenced
this pull request
Jun 8, 2026
Caught in review of #6: the C7 pair keys (pause_kind, function_call_id) were being passed via EventData.extra_attributes, which _enrich_attributes() copies at the top of attrs *before* attrs["adk"] = _build_adk_envelope(...). That landed them at attributes.pause_kind / attributes.function_call_id, not attributes.adk.pause_kind / attributes.adk.function_call_id. The customer SQL pinned in google#293 v5 acceptance #3 is: JSON_VALUE(attributes, '$.adk.function_call_id') = JSON_VALUE(...) so the pair join would have returned null on every row. This commit makes the contract match the SQL. Changes: * EventData gains adk_extras: dict[str, Any], a sibling of extra_attributes that lives INSIDE attributes.adk. * _enrich_attributes merges adk_extras into the envelope after _build_adk_envelope (envelope wins on conflict — producer-derived identity fields like source_event_id are the source of truth). * The two emit sites (TOOL_PAUSED in on_event_callback, TOOL_COMPLETED in on_user_message_callback) pass the pair keys via adk_extras= instead of extra_attributes=. * The three C7 tests are updated to assert json.loads(row["attributes"])["adk"]["pause_kind"] etc., locking in the right shape this time. Full plugin suite: 252 passed.
4 tasks
caohy1988
added a commit
that referenced
this pull request
Jun 9, 2026
…ee-authoring beat Implements the three maintainer-review items that strengthen gate evidence (the rest stay RFC-design-level): * Contract-hash drift (review #3): Capability.contract_hash() = sha256(input_kind + output schema), recorded per referenced capability in FrozenWorkflowRecord; import rejects contract drift even when the manual version string was never bumped (manual versions demoted to secondary). * Auditable lint suppression (review #6): allow_self_chain capability policy opts legitimate draft->critique->redraft refinement out of the self-review lint; per-plan lint_waivers (node id -> justification) suppress lints AND are recorded in the frozen record/export envelope. * Free-authoring beat (demo review): 'freely/decompose' trigger gives the planner ONLY goal + capability descriptions (no recipe) — the honest model-authored claim; recipe-free instruction pinned by a CI test. Suites: 11 + 34 + 8 = 53 deterministic tests.
…rvisor + ctx.pipeline) RFC google#92 reference harness. DynamicNodeSupervisor (gate-on-leaf, TaskGroup fan-out) + ctx.pipeline/ctx.parallel on the real ADK Workflow engine. 11 deterministic CI-safe tests (no LLM) + an env-gated live Gemini E2E. Proves barrier-free execution, failed-item isolation, control-exception cancellation (requires TaskGroup, not gather), nested no-deadlock with leaf gating (+ driver-gating deadlock contrast), and resume exactly-once for completed children. pyink/isort/mdformat clean.
A model emits a declarative, validated WorkflowSpec (typed data, not code) that the framework validates and executes on the real ADK engine via the google#92 supervisor. authoring.py (WorkflowSpec plain kind-tagged recursive union; CapabilityRegistry; WorkflowSpecValidator; SpecInterpreter for step/fan_out/ branch/loop_until), 10 deterministic tests, env-gated live planner sweep (multi-stage/branch/loop on gemini-3.5-flash, shape-specific assertions). Findings folded into the RFC: open dict[str,X] maps are a structured-output hazard (Branch.routes -> list[Route]); Gemini response_schema rejects Field(discriminator=...), so the vocabulary is a plain kind-tagged union; planning vs capability quality are separable. pyink/isort/mdformat clean.
A discoverable `root_agent` (a Workflow) that exposes the google#93 flow in ADK Web: the model authors a typed WorkflowSpec, ADK validates it against the capability registry, freezes spec+hash to session state, and executes it on the real engine via the google#92 supervisor — each step surfaced as a chat message so the authored plan, validation, capabilities, frozen hash, and final output are all visible in the ADK Web chat / State / Events surfaces. adk web contributing/samples/workflows/authored_workflow_demo Load-or-author: if a frozen spec already exists in the session it is REUSED (planner not re-invoked) and replayed — so the resume/reproducibility claim is real, verified by a CI-safe no-LLM test (test_demo_agent.py: import + name + registry + spec validation + reuse-path-with-stub-registry). Reuses the authored_workflow_spike/ stack; model from env (default gemini-2.5-flash; gemini-3.5-flash needs location=global); no hardcoded project. README is the ~7-min recording script. pyink/isort/mdformat clean; 4 demo tests pass.
…rage tiers) Standalone design for the authored-workflow spike: data model, validator, interpreter, frozen-spec contract, security model, testing, empirical findings, and the plan export/storage tiers (v1 per-run persist; v1.1 portable JSON export envelope; v2 reusable templates with import-time registry revalidation; compiled Workflow is a derived artifact, never the stored source of truth).
…a, import-input rule DESIGN.md §10: spec_hash/task_input_digest defined as sha256 over canonical JSON; envelope carries an optional task_input_schema; import contract — digest is advisory provenance for replaying the original run, template reuse validates a new task input against task_input_schema, else replay-only on matching digest or explicit template promotion. Never silently bind a stored plan to an incompatible task shape.
One FrozenWorkflowRecord backs session state, audit event, and export envelope
(§5/§10) — v1 persists the full record under authored_workflow:frozen_record,
not a weaker {spec,hash} subset. import_plan recomputes spec_hash (reject on
mismatch) and re-runs validation against the current registry rather than
trusting envelope.validation; replay vs template execution-input rule made
explicit. 'discriminated-by-kind' -> 'plain kind-tagged union' wording fix.
…-first, 3-5-task build gate)
Demo persists only {spec, hash}; production v1 stores the full
FrozenWorkflowRecord (DESIGN.md §5). State it explicitly in DESIGN.md, the demo
agent.py freeze step, and DEMO_NARRATIVE.md so the demo isn't misread as the
canonical persistence contract.
…aude Code comparison Pipeline/PipelineStage make barrier-free multi-stage per-item flow first-class so the authoring vocabulary is not less expressive than its google#92 executor. Add a candid 'Comparison to Claude Code Dynamic Workflows' (wins: audit/safety; gaps: expressiveness/maturity, plan-size ceiling, quality-pattern templates, scale). Hierarchical/sub-plan authoring noted as post-gate future, not MVP.
authoring.py now covers step/fan_out/pipeline/branch/loop_until. Pipeline + PipelineStage compile to google#92 ctx.pipeline: each item threads ALL stages barrier-free (item A in stage k while item B in stage 1) — NOT two barriered fan_outs. Validator: over is a list, every stage capability exists and takes an item, stage input scope. Interpreter: stage[0] input defaults to the per-item element, stage[n] to stage[n-1] output; collect=list. 3 new deterministic tests (13 total): validator accept/reject pipeline, and an ordered + BARRIER-FREE proof (a verifier starts before the slow reviewer finishes — impossible with two barriered fan_outs).
P2: each Pipeline stage dispatches once per item, so every stage capability is subject to the same data-dependent fan_out cap as FanOut. The interpreter now rejects (pre-dispatch) when len(items) > stage cap.max_fan_out, closing a gap where a pipeline over N items bypassed a capability's cap — the RFC security model relies on runtime enforcement of these caps. New deterministic test (14 total) asserts rejection before any stage runs. P3: sync stale shape lists (add pipeline) and test counts after the prior Pipeline addition — authoring 10/13 -> 14, totals 11+14+4 = 29, across authoring.py, test_authoring.py, both READMEs, DESIGN.md, DEMO_NARRATIVE.md.
…y-audit demo The demo now exercises Pipeline — the construct that closes the Claude Code gap — without adding visual complexity. The planner authors pipeline -> step -> step: a reviewer->verifier pipeline over the files (each file reviewed then its finding verified, barrier-free per item), then triager, then formatter. - agent.py: add a 'verifier' capability (Finding -> confirmed Finding); planner instruction authors the pipeline; capability collection walks pipeline stages so the displayed/audited capability set includes stage caps. - test_demo_agent.py: demo spec + stub registry use Pipeline/verifier; reuse path still proves no-LLM frozen replay. - Narrative + README: Beat 1 plan is pipeline -> step -> step; Beat 4 calls out reviewer/verifier interleaving per file; fresh captured hash 1f4c0883beb6. Validated live on gemini-3.5-flash: planner authored the pipeline 3/3 trials (no flakiness); reuse replays the same hash without re-invoking the model.
…e registry The first chat message hardcoded (reviewer, triager, formatter) and so contradicted the validation list after verifier was added. Derive it from CapabilityRegistry.names() (new accessor) so the recording can't drift from the registered set again.
…plan + demo beat Makes the frozen plan a first-class, portable artifact (DESIGN.md §10), so the RFC's enterprise claim (reviewable / diffable / replayable model-authored plans) is real, not paper. authoring.py: - FrozenWorkflowRecord (the single §5 shape) + ValidationResult; FrozenWorkflowRecord.freeze() captures spec_hash, planner_model, registry + per-capability versions, validation, task_input_digest. - canonical_json / sha256_hex — the one fixed hash definition (sort_keys + tight separators) so two exporters agree. - export_plan(record) -> dict; import_plan(envelope, registry, task_input=None) that NEVER trusts the envelope: recompute sha256 (reject tamper), re-validate vs CURRENT registry (reject dropped capability), reject per-capability version drift; execution-input contract (replay needs matching input digest; template needs task_input_schema). - Capability.version + CapabilityRegistry.capability_versions() for drift detection; referenced_capabilities() walker. test_authoring.py (+5 -> 19): round-trip replays same hash; tamper rejected; dropped capability rejected; version drift rejected; new input without template schema rejected (and accepted once a schema is attached). demo: an 'Export plan' beat writes the full envelope to security_audit_plan.json and re-imports it (proving defensive import). Unifies the displayed hash on the canonical definition. Narrative/README gain Beat 3b; counts 11+19+4 = 34. Generated envelope + demo session dbs gitignored. Validated live on gemini-3.5-flash: export beat writes a complete envelope, re-import passes, reuse replays the same hash.
…sort P1: isort ordering in test_authoring.py imports (PlanImportError after PipelineStage) — pre-commit clean. P2: import_plan now hard-errors on registry_version drift (envelope vs current registry.version), matching DESIGN.md §10 'registry-version match … drift = hard error'. Previously only dropped capabilities + per-capability versions were checked. P3: import_plan rejects an unsupported schema_version (only 'v1' supported) — a defensive importer refuses formats it can't read. +2 deterministic tests (-> 21): registry-version drift and unsupported schema_version both rejected. Counts 11+21+4 = 36.
…bservability Adds DESIGN §11 'Convergence with ADK AgentConfig' (renumbers Future -> §12) in response to reviewer questions on issue google#93: - Lower the static subset (sequence/parallel/loop) to ADK's Sequential/Parallel/ LoopAgentConfig instead of reinventing serialization; keep branch/fan_out/ pipeline as new types only because config can't express them (static sub_agents resolved once at load; no ConditionalAgent; needs google#92 ctx.pipeline). - Why the planner does NOT emit raw AgentConfig: static graph; Discriminator union rejected by response_schema; FQN tool/agent/callback refs (importlib, no allow-list) re-open the code-exec surface the declarative+allow-list model closes. - Q1 storage: FrozenWorkflowRecord in session State + audit event + export envelope. - Q2 custom tools: registered capability by registry name (allow-list), not FQN. - Q3 version/observability: spec_hash + registry/capability versions -> drift rejected on import; compiled Workflow runs on the real engine so ADK tracing applies. All claims source-verified against agents/agent_config.py and config_agent_utils.py.
Address review on the convergence section: - 'design converges / should lower', not 'now lowers' — the spike does not yet implement an AgentConfig-lowering compiler (explicit caveat added). - precise table: static parallel block -> ParallelAgentConfig; runtime fan_out/pipeline/branch have no direct config equivalent (ParallelAgentConfig is static parallel sub-agents, not data-mapping over a runtime list). - soften FQN wording to a trust-boundary mismatch (FQN imports are fine for developer-authored config; the concern is a MODEL authoring raw FQNs), not 'config is unsafe'.
Tie the demo to RFC google#93 §11 without overclaiming: this plan's top-level sequence is the kind of static shape that should lower to SequentialAgentConfig, while the reviewer->verifier pipeline (per-item over a runtime list) is exactly what AgentConfig can't express. Explicitly notes this is a design direction — the demo runs via SpecInterpreter and does NOT lower to AgentConfig (no such compiler in the spike). README section + a presenter aside in the narrative.
…allback 'give me the best chart for revenue by region' matched the sequence trigger 'revenue by region' before the tournament trigger 'best chart' (first-match in definition order) and replayed the frozen Q&A plan. The sequence scenario is the generic fallback for ANY question, so the router now checks all specialized scenarios first and falls back to sequence only when none match. Overlap cases pinned in the routing test (incl. the advertised prompt 7, which had the same latent collision).
…ifacts BigQuery Conversational Analytics returns Vega-Lite chart specs; the demo now does too. New deterministic render_chart capability (no plotting deps): input-tolerant (query rows dict, raw rows list, tournament winner list, or a bare chart-type string) -> chart_type + Unicode bar preview (renders in the ADK Web chat) + Vega-Lite spec. Wired into the sequence recipe (chart the query rows) and the tournament recipe (render the data with the winning mark); a 📈 beat surfaces any chart artifact found in interpreter state. 15 -> 16 CA tests, 71 total.
…ne) + inline chart images Two upgrades from live dogfooding: * run_query is now a deterministic micro-warehouse: a synthetic 24-month x 4-region x 4-category fact table + SQL-intent parsing. The executor AGGREGATES the facts per the query's grouping (month/region/category), window (INTERVAL N YEAR/QUARTER/MONTH), filters (country/region/category literals), and measure alias (SUM(...) AS name) — replacing the canned keyword rows whose shape couldn't answer a trend question (live gap: a monthly-trend SQL got region rows back). Honest scope documented: intent execution, not SQL parsing; real BigQuery is the production step. * render_chart infers a LINE mark for date-shaped x labels (explicit tournament winners still win); _chart_png renders the artifact to PNG via matplotlib (optional) and the 📈 beat emits it as an inline image part — ADK Web shows a real chart; falls back to the Unicode preview. 18 -> 21 CA tests (engine windows/trend-alias-filter/total/category, line inference, PNG magic bytes, derived encoding); 76 total.
…s for all time buckets Live gap #2: 'sales trend for global based on 3 years' authored EXTRACT(YEAR ...) AS year GROUP BY year — the engine only knew monthly grouping, fell through to a single anonymous grand total, and charted one '?' bar. Fixes: * Time grains month/quarter/year (week -> month) detected from DATE_TRUNC(..., G), EXTRACT(G FROM ...), AS g aliases, and the GROUP BY clause — scoped to the actual clause (stop at ORDER BY/LIMIT, INTERVAL phrases stripped, so a trailing 'INTERVAL 1 YEAR' window never reads as a yearly grouping). * Monthly facts bucket into the requested grain (quarters as YYYY-Qn, years as YYYY); quarters sum exactly to their year. * Line-mark inference covers monthly, quarterly, and yearly labels (>= 2 points; a single total stays a bar; explicit winners still win). Pinned: the exact live yearly SQL, quarterly buckets, bucket-consistency, label inference. 21 -> 22 CA tests, 77 total.
…t + multi-series charts Replaces semantic guessing with the real thing, per review feedback: * dry_run -> the actual BigQuery dry-run API (real errors, real bytes-scanned); run_query -> real execution against bigquery-public-data.thelook_ecommerce billed to GOOGLE_CLOUD_PROJECT, with safety rails: maximum_bytes_billed = 2 GB/query, 500-row cap, result cells JSON-ified (Decimal/date/datetime). Bare table refs are auto-qualified and backticked. Fallback to the deterministic micro-warehouse without credentials or with CA_DEMO_USE_BIGQUERY=0; every beat carries an engine field (bigquery/mock/mock-fallback) so the data source is never misrepresented. * Multi-series charts: x/series/measure derived from the result shape (time fields by name/value, a second categorical becomes one line per value, measures picked by name preference — an int year column is never mistaken for the measure); Vega-Lite color encoding + matplotlib multi-line PNG with legend; ascii preview field-aware and capped. 22 -> 27 CA tests (qualify, jsonify, no-credentials fallback, multi-series pivot, and a live-gated real-BigQuery round-trip — dry-run bytes, real error, real rows; passing locally). 81 total.
…rrors; no fabricated fallback Live finding: nl2sql emitted TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 YEAR) — a genuine BigQuery dialect error the real dry-run caught — but (a) the linear sequence plan ignored the verdict and ran the query anyway, and (b) the executor's no-credentials fallback fired on the FAILING query and served mock rows as an answer. * The ask-a-question recipe is now the real CA flow: loop_until( draft_or_repair_sql reading the loop-carried failed dry-run output -> REAL dry_run) -> run_query -> chart + summarize. The model repairs from the actual BigQuery error text. * _execute_sql no longer fabricates: a failing query returns empty rows + the real error (engine: bigquery); the mock engine is reserved for missing credentials only. * Tests: expected sequence shape updated; no-fabrication contract pinned with a raising fake client; plan-executing e2e tests pin the mock engine (no network in unit tests). 27 -> 28 CA tests, 82 total.
Frozen plans now outlive the session. On freeze, the FULL FrozenWorkflowRecord (spec, hashes, planner/registry/capability versions + contract hashes, captured task_input_schema = template promotion) is exported to ca_plan_store/<scenario>.json — the demo's stand-in for the ArtifactService in the RFC's revised Q1. A new session's reuse order is: session state -> plan store -> author fresh. The store path uses the RFC's DEFENSIVE import: spec_hash recomputed, re-validation against the CURRENT registry, manual-version + contract-hash drift are hard rejections (shown in chat, then fresh authoring — drift never silently replays a stale plan), and a new question is validated against the captured schema (cross-session template reuse). Tests: store round-trip + template reuse with a new question, tamper (hash-mismatch) rejection, contract-drift rejection with an unbumped version. 28 -> 30 CA tests, 84 total.
…gh repair loop, declared-output contracts 1. Blocker: isort import order in agent.py (pre-commit now clean locally). 2. High: the repair-round claim is now real — Sql echoes the question (schema field + instruction) and _bq_dry_run preserves it through every branch, so after a FAILED real dry run the loop-carried value holds question + sql + error, not sql + error. Pinned with a fake-client failure test asserting the question survives. 3. Medium: deterministic capabilities (dry_run, run_query, render_chart, profile_table, quality_report, describe_schema, keep_verified, flaky) now declare output models, so their contract hashes cover real output schemas; plan-store wording softened to declared-contract drift. 4. Medium: PR body refreshed (real BigQuery, plan store, counts). 30 -> 31 CA tests, 85 total.
…n manual versions Review precision item: sql_ok / pair_charts / judge_chart / single_chart return bare bool/str/list values (wrapping them in declared models would change runtime shapes their consumers depend on), so their contract hashes cover input_kind only. README and PR wording now scope the claim to the typed object-output capabilities; test-count wording fixed (31 collected = 30 CI-safe + 1 live-gated).
…rectly, no workflow Live finding: 'tell what kinds of workflow you can issue?' fell through the trigger router to the data scenario; the frozen plan replayed and NL2SQL turned a question about the AGENT into a query about order statuses. The root agent now has a front door — the RFC's no-plan escape hatch (DESIGN §12) implemented: untriggered messages are intent-classified (data | meta | chat). Meta/chat turns get a direct conversational reply (a catalogue of the seven workflow shapes, built from SCENARIOS so it never drifts) at the cost of one intent call — no planner, no queries. Data questions and trigger prompts proceed unchanged. Gate instruction and catalogue are template-safe (brace-free), pinned by test alongside the routing split (triggered bypasses gate; the exact live meta-question gates). 31 -> 32 CA tests, 86 total.
Review items: (1) pin the FULL no-workflow escape path — with a stubbed intent gate returning meta, the root agent's run yields the gate reply and the conversation output and returns BEFORE plan-store import, session replay, authoring, or execution (empty plan store asserted untouched; no authored/reused/validation beats in the transcript). (2) PR body counts synced (33 collected = 32 CI-safe + 1 live-gated; 88 total collected).
…rated, then canned
Live finding: 'audit this insight <X>' selected the adversarial mode but
audited the CANNED demo insights — the user's claim was discarded (mode
selectors kept canned task inputs). The audit scenario now resolves its
insights in order: (1) inlined in the message ('audit this insight: X',
';'/newline-separated lists, typo-tolerant filler), (2) the session's last
generated insight ('audit that insight' after a question — each sequence
result is remembered in state), (3) the canned demo set as final fallback.
The banner states which source is being audited. Frozen plan unchanged —
new insights are template reuse through the same fan_out(skeptic) plan.
33 -> 34 CA tests, 88 total.
…dit beat Live finding: the skeptic step was a raw event blob with a bare (insight, refuted) pair — no reasoning, easy to miss, and with one insight only one skeptic runs so there was 'nothing to see'. * Verdict gains a required-by-instruction reason field — the skeptic must show its work (what it checked, why the claim stands or falls, caveats like partial years). * New audit beat renders every verdict found in interpreter state as one line: REFUTED / upheld — insight — reason. * _verdict_of carries the reason (tolerant of JSON-string verdicts); insight extraction strips trailing punctuation from user phrasing. 34 -> 35 CA tests, 89 total.
…insight User question: does the skeptic fan-out really dispatch isolated agents, or do they read the conversation history? Settled empirically: a spy on the model layer (before_model_callback short-circuits, no network) captures each fanned-out skeptic's actual LLM request. Asserts: one real dispatch per insight; each request contains exactly its own insight — not the sibling's insight, not a planted prior chat beat, not the user's turn message. ADK's workflow wrapper runs plain agents in single_turn mode with per-dispatch isolation scoping, so the model call gets node_input only. This is the runtime half of the independence story (the binding lints are the static half). 35 -> 36 CA tests, 90 total.
…olation (tool-loop fix) The data-grounded skeptic (a real BigQuery verification tool on the adversarial reviewer) exposed a REAL isolation bug, fixed at the source: * fix(src/workflow): prepare_llm_agent_context dropped the dispatch's isolation_scope — Context only inherits it via parent_ctx, which is not passed — so the single_turn input event was appended UNSCOPED. A fanned-out agent making MULTIPLE model calls (tool loop) rebuilds its contents per call and picked up the LATEST sibling's input instead of its own: parallel tool-using skeptics all answered the last claim (single-call agents were correct only by timing). The agent context now carries the scope; verified offline (spy-driven simulated tool loop) and live (per-claim verdicts with real queried evidence). * spike(google#92 supervisor): every dispatch now runs in its own sub-branch and its own isolation scope (parent_scope::run_id) — per-dispatch context independence is structural for multi-call children. * feat(ca-demo): the skeptic is DATA-GROUNDED — output_schema + a real query_thelook tool (capped, engine-honest); instruction demands queried evidence in the verdict reason; capability version bumped to 2 so stored plans drift-reject instead of silently reusing the plausibility-only skeptic. Live verification: the $1M-AOV claim is REFUTED with the actual computed ~$86 AOV; the Shipped-status claim upheld with real counts. * tests: tool-loop sibling-isolation regression (deterministic spy FC, no network; gated on the patched wrapper so stale local installs skip); single-call isolation assertions made version-tolerant; grounded-skeptic config + tool pins. CA suite 38 collected; full suite 92 under the patched tree, 91+skip under a stale ADK.
…tated HTML page Reads every FrozenWorkflowRecord envelope in ca_plan_store/ and writes a self-contained plan_inspector.html: the plan's dataflow as a diagram and every envelope field annotated with the guarantee it delivers — spec_hash (tamper evidence), planner_model/created_at (authoring provenance), registry + capability versions and derived contract hashes (drift detection), task_input_schema/digest (cross-session template reuse), validation (recorded, never trusted). The on-camera artifact for the "plans as durable, auditable data" story.
The banner still said 'over mock thelook_ecommerce' — stale from before the real-BigQuery upgrade and misleading on camera. It now reports LIVE bigquery-public-data.thelook_ecommerce when credentials allow, or the mock warehouse fallback, matching the engine field on every result.
…ES__ metadata Live finding: the demo listed only 4 of thelook_ecommerce's 7 tables, and profile_table still returned CANNED values chosen to look plausible — exactly the kind of fabrication the engine-honesty rule exists to prevent. * TABLES now covers the full real dataset (adds events, inventory_items, distribution_centers) — also widens nl2sql coverage to event/inventory questions. * profile_table queries the real __TABLES__ metadata (row_count, size_mb; cheap metadata reads) with engine=bigquery, falling back to clearly labeled canned values without credentials. Live-verified: events 2.43M rows / 368MB, distribution_centers 10 rows. * TableProfile/QualityReport contracts updated (row_count + size_mb; report = totals + largest table) — a DELIBERATE contract change: stored fanout plans drift-reject and re-author rather than running stale semantics. Fanout e2e pins all seven tables via the mock engine.
…ve path User ask: make the whole demo run on the real thelook_ecommerce dataset. Remaining fakery removed: * describe_schema v2: a data-grounded agent (query_thelook tool) that answers metadata questions by querying the REAL data (DISTINCT values, counts) — live-verified answer cites shipped_at/delivered_at semantics it discovered itself. The canned _SCHEMA_NOTES sentence is gone. * The repair scenario repairs a REALLY broken query: the task carries SQL referencing thelook_ecommerce.order (not orders); the REAL dry-run rejects it, the repair step fixes it from the actual BigQuery error, and run_query returns real rows (live-verified: China $364K...). The simulated flaky_dry_run capability and _FLAKY_CALLS are deleted — transient-failure simulation now lives only in the CI test stub. * render_chart: a bare tournament winner now charts REAL revenue-by-region rows fetched from BigQuery (canned rows remain only inside the no-credential mock engine). * Chart/loop/branch/tolerance tests pin the mock engine explicitly (no network in unit tests); the branch e2e stubs the now-LLM describe_schema. The only non-real component left BY DESIGN is the no-credential fallback warehouse, always labeled via the engine field. 38 CA tests collected; 91 green locally (92 under the patched ADK tree).
…the stray) The console's Agents Hub shows 8 tables; the demo's hand-written catalogue listed 7. The 8th is real: 'thelook_ecommerce-table', an empty stray placeholder (0 rows, dashed name) in the public dataset. The profiling fan-out now discovers the table list live from __TABLES__ (cached per process; curated-catalogue fallback without credentials), so it always matches what the console shows — and surfacing the empty stray is itself an honest data-quality finding. Profile-name validation now permits dashed table names. 39 CA tests collected.
Matches the production CA agent's 7-table scope: the 0-row 'thelook_ecommerce-table' placeholder is excluded from profiling fan-outs.
…/run/chart) Found via the production-CA head-to-head: the shipped pipeline scenario only translated+validated panels; the production agent executed them. Our vocabulary already expresses the full shape — pipeline stages draft_or_repair_sql -> REAL dry_run -> run_query -> render_chart, per panel, barrier-free. Live result: 3 real panels (12-row revenue line, top-5 categories bar, 60-row multi-series users-by-traffic-source line) in 10.0s / 12 dispatches against the real dataset. Panel questions updated to concrete 2025 asks; e2e pins 3 executed chart artifacts.
…visions The CA head-to-head showed our determinism was plan-level only — the drafting LLM still varied date windows run-to-run. SQL freezing extends the freeze one level deeper, and human feedback makes the artifact governable: * After a question's SQL passes the REAL dry-run, it is frozen to ca_plan_store/sql/<question-digest>.json (sql, sha256, engine, bytes, validated_at, revisions[]). Re-asking the exact question replays via a STATIC replay plan (dry_run -> run_query -> chart -> summarize; constant hash, no drafting step): the dry-run re-validates (warehouse-drift detection) and the numbers are deterministic. Live-verified: identical $644,971.22 across runs with the drafting LLM skipped. * 'revise: <feedback>' revises the frozen SQL for the session's last question (draft capability is now feedback-aware), MUST pass the real dry-run before replacing the artifact, and records the feedback + previous SQL in the revisions history — an auditable trail of who changed the query and why. Live-verified: 'exclude Cancelled/Returned' -> validated revision -> re-frozen -> executed (China $644K -> $480K). * Drift safety: a frozen SQL that no longer validates falls back to fresh authoring with the rejection shown. Tests: store roundtrip + normalized digests + revision history, static replay plan (constant hash, lint-clean, no drafting capability), e2e replay with 4 dispatches (no draft), revise-trigger routing. 95 total.
…istory The 🧊 cards show the pinned statement (replays skip the drafting LLM), sql_hash/validated_at/engine, and every human-feedback revision with the preserved previous SQL — the numeric-determinism and governance benefits made visible alongside the plan envelopes.
plan_inspector.py now takes an optional session id (plus app/user) and reads the actual session from the running ADK server: each user turn is rendered as a timeline card classified by the mechanism that answered it (frozen-SQL replay / human revision / authored fresh / frozen-plan reuse / conversation), with the sql hash, applied revision count, and the resulting insight — so the page tells the story of the demo run on screen, with the artifacts those turns created and reused directly below. python .../plan_inspector.py <session-id> [app] [user]
…frozen middle results The page now pitches RFC google#93 with the demo as evidence, not the demo itself. The model-authored typed plan is the centerpiece; the validated intermediate results its steps produce (the dry-run-checked SQL, in this instance) render INSIDE the workflow card, attached to the step that produced them — the drafting loop carries a 'result frozen, SKIPPED on replay' badge. Session timeline beats use RFC vocabulary (model authored the workflow / frozen-workflow replay / step-result replay / human-governed revision), and a closing card generalizes the mechanism beyond SQL (retrieved schemas, verified claims, chart specs) into the RFC's freezing tiers v1 / v1.1 / v1.2 / v2.
Alignment check against a live session showed the page mixed THIS session's frozen middle result with artifacts from other sessions (the store is global by design). With a session id, middle results now filter to questions actually asked (or revised) in that session; artifacts from other sessions collapse into a labeled details card — the page mirrors the demo run on screen while still showing the cross-session store.
User question exposed the gap: middle results were stored BESIDE the frozen workflow (linked only by the inspector's heuristic), not structurally attached to it. The artifact now records plan_hash and produced_by_step at freeze time; revisions inherit the lineage. Design kept deliberate: the plan envelope stays a question-agnostic TEMPLATE (embedding per-question instances would bloat and conflate it) — middle results are per-task-input instances that REFERENCE their plan. The inspector attaches artifacts to workflow cards by recorded lineage (heuristic fallback for pre-lineage records) and renders the lineage on each card. Lineage pinned by tests incl. revision inheritance.
918e81c to
b41323a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1.
dynamic_supervisor_spike/(co9Prototype
DynamicNodeSupervisor(gate-on-leaf,TaskGroupfan-out) +ctx.pipeline/ctx.parallelon the real ADK engine.supervisor.py,test_dynamic_supervisor_spike.py(11 deterministic tests),test_live_gemini_e2e.py(env-gated),README.md.TaskGroup, notgather), nested no-deadlock with leaf gating (+ driver-gating deadlock contrast), resume exactly-once for completed children.2.
authored_workflow_spike/(agent-authored typed Workflows)A model emits a declarative, validated
WorkflowSpec(typed data, not code) executed on the real engine via the supervisor.authoring.py(WorkflowSpecplainkind-tagged-union tree,CapabilityRegistry,WorkflowSpecValidatorincl. plan-quality lints,SpecInterpreterfor step/fan_out/pipeline/branch/loop_until with loop-carriedLoopUntil.init+dispatch_count, plusFrozenWorkflowRecord/export_plan/import_plan, derived contract-hash drift (primary signal, fail-closed on stripped hashes; manual versions secondary), auditable lint waivers (allow_self_chainpolicy + per-plan waivers recorded in the frozen record), andindependence_facts),test_authoring.py(36 deterministic tests, incl. a barrier-free pipeline proof, per-stagemax_fan_outenforcement, plan export/import round-trip + tamper / capability+registry version drift / schema_version checks, ADK-config lowering of the static subset, six-coordination-pattern coverage — adversarial verification + tournament — and the plan-quality lints),test_live_planner_sweep.py(env-gated; multi-stage/branch/loop ongemini-3.5-flash),DESIGN.md(canonical technical design incl. plan export/storage tiers),README.md.call_agent_asyncfunction not provided google/adk-python#93: open-dictmaps are a structured-output hazard (Branch.routes→list[Route]); Geminiresponse_schemarejectsField(discriminator=...)so the vocabulary is a plainkind-tagged union; planning vs capability quality are separable; the pattern sweep surfaced a vocabulary gap — tournament needs loop-carried state, henceLoopUntil.init(binding the loop's own id withoutinitis a validation error).independence_facts()derives the provenance statements ("stageverifiersees ONLY stagereviewer's per-item output") the frozen record proves to an auditor.root_agent.yaml(DESIGN §11, per reviewer feedback on Tutorial: Updatedcall_agent_asyncfunction not provided google/adk-python#93): static graph skeletons should lower/export toward thecontributing/samples/workflows/loop_config/root_agent.yamlstyle (agent_class: Workflow, staticedges, child YAML), whileWorkflowSpecremains the model-facing source of truth. Raw YAML is not the planner output becauseloop_configintentionally resolves Python function refs (.agent.route_headline),_coderefs, child config paths, tools/callbacks, and possibly FQNs.branch/runtimefan_out/pipelinestay new typed blocks because config does not directly express runtime per-item dispatch / barrier-free multi-stage flow; YAML would need a wrapper node. Caveat:Workflowitself is not deprecated, but the current config loader path and agent-config sugar classes are@deprecated+@experimental; this is convergence with the Workflow config shape for compatibility, not a long-term dependency on today's loader or deprecated sugar.3.
authored_workflow_demo/— ADK Web demo wrapperA discoverable
root_agent(aWorkflow) that exposes the flow in ADK Web — authors (scripted or free-form — thefreely/decomposetrigger gives the planner only goal + capability descriptions, no recipe) → validates → independence-lints → freezes-to-state → exports → lowers static config shape → executes with a cost line (capability dispatches + planner calls), surfacing each step as a chat message (authored plan, validation, independence facts, capabilities, frozen hash, exported plan, config projection, output, cost).security_audit_planner/agent.py,test_demo_agent.py(8 CI-safe tests, incl. a no-LLM reuse-path test + dispatch-count assertion, the lints/independence-facts check, the quality-gate (self-review rejection) shape pin, the recipe-free free-authoring instruction pin, and the config-lowering assertion),README.md(the ~7-min recording script),DEMO_NARRATIVE.md(beat-by-beat narration from a real run).adk web contributing/samples/workflows/authored_workflow_demo.4.
authored_workflow_ca_demo/— BigQuery Conversational Analytics plannerOne agent, seven prompts, seven authored shapes — the pattern-coverage table made tangible in a real product surface (styled after BigQuery Conversational Analytics). Query execution is REAL BigQuery when credentials allow:
dry_runhits the actual dry-run API (real errors + bytes-scanned, with the user question preserved through the loop-carried value so repair rounds get full context) andrun_queryexecutes againstbigquery-public-data.thelook_ecommerce(billed toGOOGLE_CLOUD_PROJECT;maximum_bytes_billed= 2 GB/query, 500-row cap; a failing query returns the real error — never fabricated fallback rows). Without credentials, a deterministic micro-warehouse (synthetic facts + SQL-intent aggregation incl. month/quarter/year grains) takes over; every beat carries anenginefield. The default ask-a-question flow is the real CA shape: loop_until(draft → real dry-run → repair from the real error) → run_query → render_chart + summarize, with the live question as task input (template reuse on re-send). Charts: Vega-Lite artifact + inline matplotlib PNG, line inference for time series, multi-series (one line per region) for two-dimensional results. Cross-session reuse: frozen plans export their fullFrozenWorkflowRecordto a plan store and new sessions import them through the defensive path (hash/registry/declared-contract checks; new questions validated against the capturedtask_input_schema); the typed object-output capabilities declare output models so contract hashes have teeth (primitive helpers —sql_ok,pair_charts,judge_chart,single_chart— return bare values and rely on manual versions).bq_ca_planner/agent.py,test_ca_demo_agent.py(38 collected — 36 CI-safe + 1 live-gated real-BigQuery round-trip + 1 tool-loop sibling-isolation proof (now runs against unmodified upstream —isolation_scopeis fixed inmain): all seven shapes validated + lint-clean + executed end-to-end with language capabilities stubbed, engine grains/windows/filters, qualify/jsonify, no-fabrication contract, question-preservation through failed dry runs, multi-series chart pivot, cross-session store round-trip/tamper/drift, the conversational intent gate incl. a no-LLM end-to-end escape-path test, live-insight audits with last-insight memory, the DATA-GROUNDED skeptic (real query tool + output_schema, v2 contract bump), runtime isolation proofs — single-call and tool-loop — and skeptic verdicts with reasons),README.md(prompt table + the 10-turn mixed-workflow conversation script).adk web contributing/samples/workflows/authored_workflow_ca_demo --port 8001.Hygiene
pyink/isort/mdformat/pre-commitclean. Deterministic suites: 11 + 36 + 8 + 38 = 93 collected (gated skips in CI as noted) green. Live tests env-gated (skip withoutSPIKE_LIVE+ project). The fork-onlyagent-triage-pull-requestcheck fails on an emptyGITHUB_TOKEN(non-actionable; won't occur upstream).